Rubric Comparison Report

Comparing precomputed rubrics from O3, O4-Mini, and GPT-5 across all tasks

677
Total Tasks
12
Segments
compositional_tasks_v2 (87 tasks)
flights (51 tasks)
hotels_head (52 tasks)
jobs (38 tasks)
price_comparison (57 tasks)
realestate_complex (48 tasks)
recipe_to_shopping (48 tasks)
restaurants_tail (52 tasks)
shopping_head (56 tasks)
shopping_lists_tail (51 tasks)
things_to_do (80 tasks)
ticketing (57 tasks)